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Abstract Recent years have witnessed the increase of competition in science. 
While promoting the quality of research in many cases, an intense competition 
among scientists can also trigger unethical scientific behaviors. To increase the 
total number of published papers, some authors even resort to software tools 
that are able to produce grammatical, but meaningless scientific manuscripts. 
Because automatically generated papers can be misunderstood as real pa¬ 
pers, it becomes of paramount importance to develop means to identify these 
scientific frauds. In this paper, I devise a methodology to distinguish real 
manuscripts from those generated with SCIGen, an automatic paper genera¬ 
tor. Upon modeling texts as complex networks (CN), it was possible to dis¬ 
criminate real from fake papers with at least 89% of accuracy. A systematic 
analysis of features relevance revealed that the accessibility and betweenness 
were useful in particular cases, even though the relevance depended upon the 
dataset. The successful application of the methods described here show, as 
a proof of principle, that network features can be used to identify scientific 
gibberish papers. In addition, the CN-based approach can be combined in a 
straightforward fashion with traditional statistical language processing meth¬ 
ods to improve the performance in identifying artificially generated papers. 

Keywords scientific frauds • SCIgen • complex networks • plagiarisms 


1 Introduction 

The dissemination of knowledge and the advancement of science strongly de¬ 
pend upon the precise interpretation of the content conveyed in scientific 
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manuscripts. Therefore, the ideas conveyed by high-quality scientific papers 
should be carefully detailed so that they can be tried and possibly improved. 
Although many qualitative aspects have been proposed to identify outstand¬ 
ing manuscripts and their respective authors, many quantitative aspects still 
prevail when the quantification of academic merit is at stake. For example, in 
recent years, the total number of articles, the number of citations motivated 
by articles or researchers’ h-index has been widely used for the purpose of 
merit evaluation [^. Clearly, there is a correlation between quantitative and 
qualitative factors [^. Nevertheless, the drawbacks related to the exclusive 
use of quantitative factors are well known. For example, recent publications 
cannot be evaluated via citation counts. Likewise, very young researchers also 
cannot be assessed according to the number of citations that their articles 
motivate, since there is an expected delay between paper publication date and 
wide scientific recognition [3j|^. 


While there are several disadvantages associated with quantitative indices, 
they unmistakably provide a minimal degree of objectivity required for any 
scientific merit assessment. Aware of the prevalence of quantitative indexes 
in scientific merit judgments, some scholars tend to shape their research only 
to increase their citation counts and other quantitative indexes [^. The pres¬ 
sure imposed by scientific competition, translated by the maxim “publish or 
perish”, literally urge a few scholars not to follow good scientific practices. 
In order to artificially boost productivity and impact indexes, some authors 
split the results arising from a single discovery in two or more papers. For 
these reasons, several scientific low-quality papers with a very weak impact on 
science have been produced. Other recurrent unethical conducts include the 

8][9 and plagiarisms 10 -12 . More recently, even 
ive been submitted and sur 

Currently, one of 

the most popular software for generating fake papers is the SClGen 14 , an 


excessive use of self-citations 
texts automatically generated have been submitted and surprisingly deemed 
suitable for publication in several scientific conferences 


algorithm able to produce gibberish papers that resemble real manuscripts. To 
produce such meaningless texts, SClGen uses a complex grammar that is able 
of generating texts containing all the features expected in a standard scientific 
manuscript. To complement the grammar, even figures and tables are gener¬ 
ated. An incremental modification of the original algorithm has implemented 
the possibility of self-citations, which has allowed a significant increase in au¬ 
thors’ h-index 15 . Because fake papers as those generated by SClGen can 


eventually bewilder even a human referee, it becomes of paramount relevance 
the identification of particular features able to discriminate real from mean¬ 
ingless texts. In this context, I focus on one approach to identify distinct styles 
in texts that has proven particularly effective to detect SClGen texts. More 
specifically, using a representation of texts as complex networks 16 , I show 


that it is possible to discriminate real and fake manuscripts with significant 
accuracy if one analyzes the structural organization of the manuscripts. Even 
though the accuracy of the proposed technique does not outperforms other tra¬ 
ditional methods based on the analysis of textual content, it is useful to show 
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the structural patterns of text organization is affected when fake information 
is conveyed. 

This manuscript is organized as follows. In Section I present related 
approaches aiming at the identification of fake scientific manuscripts. A very 
short introduction to the application of complex networks for text analysis is 
presented is Section]^ The methods employed for the representation, charac¬ 
terization and classification of text networks are presented in Section The 
results obtained with both univariate and multivariate analysis of network 
measurements are presented in Section]^ Finally, the conclusion drawn from 
the results and the perspectives for further studies are commented in Section 

El 


2 Related works 


Several methods have been devised to identify the authenticity of scientific 
manuscripts. Such methods can be classified according to the type of informa¬ 
tion that is employed as features of classifiers. Usually the list of references 
plays a important role in the task. For example, it has been shown that when 
many cited references cannot be found online, then there is a high probability 
that the paper under analysis is fake 


17 


Many heuristics rely on textual content to infer the authenticity of docu- 
The study developed in 20 proposes some useful rules. One 


18 19 


ments 

of the main rules tests whether keywords in the title and abstract occurs fre¬ 
quently in the body of the paper. If such pattern does not occur, then the doc¬ 
ument is considered as fake. Another interesting observation highlights that 
real scientific papers usually mention keywords from the titles of cited papers. 
Techniques based on the semantic content of texts also employ traditional sim¬ 
ilarity measurements 21 . An important contribution to the problem of iden¬ 


tifying gibberish publications with similiraty measurements was introduced 
In their study, the authors devised a pairwise similarity measurement 


22 


that compares two pieces of texts by counting differences in word frequencies. 
This approach was useful to identify several cases of duplicate and fake publi¬ 


cations. An extension of this work was proposed in 23 , where not only single 


word occurrences are considered, but also multi-word phrases. Other interest¬ 
ing approaches relying on textual content include the techniques based on the 
compressibility rate of texts 24,25 . In 25 , the authors show that artificially 


generated papers display values of compressibility rate that are not compatible 
with the rates observed in real manuscripts. 

Differently from approaches mentioned in this section, the method I pro¬ 
posed in this paper does not consider the semantic similarity of texts. Actually, 
the approach proposed here focus on the analysis of connectivity patterns that 
are able to capture subtleties of styles in texts from distinct sources. Because 
the proposed approach is complementary to other traditional techniques, it 
could potentially be useful to improve the reliability of the classification. 
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3 Complex networks and text analysis 


systems 


Complex networks have been employed to model a myriad or real complex 
Of particular interest to the aims of this study are the applica- 

machine translation 


16 


tions in automatic summarization 


26 27 


plexity analysis 30 -34 and authorship recognition 35 36 . In all these tasks 


28 29 


com- 


the networks obtained from the so called word adjacency model (see Section 
4.1) allowed a precise characterization of texts with regard to specific tex¬ 
tual features. Networked models even allowed the characterization of unknown 
manuscripts 37 and many other linguistic aspects 38 -42 . Interestingly, the 


particular features of each language could also be classified in terms of the 


topological structure of complex networks 43 - 45 . A more detailed survey on 
the application of network methods in text analysis can be found in 46 . Dif¬ 


ferently from traditional approaches, the method proposed here focus on the 
structure and organization of texts, rather than on the textual semantic con¬ 
tent. The proposed method also differs from the other techniques mentioned 
here because it is modified in order to to minimize the influence of the vo¬ 
cabulary size on the topological analysis of scientific articles (see Section]^. 


4 Methodology 


The methodology employed to compare real and fake manuscripts is illus¬ 
trated in Figure Firstly, the text of the scientific article are obtained by 
automatically removing the DT[5]X tags from the original texts. Because some 
undesirable tokens can still remain in the text, the ontput is checked manually. 
Following previous studies, the style of each text is quantified via topological 
characterization of complex networks (graphs) 1^-37 47 48 . For this rea¬ 


son, the texts are modeled as complex networks. Then, several connectivity 
measurements are extracted from the networks. In the next step, the measure¬ 
ments are employed as features to discriminate real and fake manuscripts. The 
discrimination is accomplished with pattern recognition methods. The main 
steps shown in Figure]^ are described below. 


4.1 Modeling texts as complex networks 


A network can be defined as a set of nodes connected by edges. To represent 
a network, consider that A = {atj} is the matrix representing the network 
structure. In texts, each distinct word is a node and edges are established 
between adjacent words. Therefore, if words i and j appear adjacent in the 
text, the element is set (a^- = 1). Otherwise, aij = 0. The total number 
of links, i.e. the node degree, is defined as k{i) = several style- 

based applications, some pre-processing steps are usually applied before the 
connection of adjacent words 49 . The pre-processing algorithm encompasses 
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LATEX TEXT FILTERED PRE-PROCESSED 

FORMAT FORMAT MANUSCRIPT MANUSCRIPT 



(5) 


NETWORK 



MEASUREMENT Ml 

TOPOLOGICAL 

CHARACTERIZATION 



MEASUREMENT Ml 


PATTERN 

RECOGNITION 



Fig. 1 Sequence of methods employed to distinguish gibberish from real scientific 
manuscripts. The actions taken in each step are: (1) tags and mathematical terms 

are stripped out; (2) the manuscript is manually checked in order to verify if its content in¬ 
cludes only textual information; (3) lemmatization and removal of stopwords-, (4) mapping 
of a text into a network; (5) extraction of complex network measurements; (6) discrimination 
of distinct classes (real or fake) via machine learning. 


a two-fold mechanism: (a) the identification and removal of stopwords; and 
(b) the lemmatization. In (a), words conveying low semantic content (e.g. 
“and”, “of”, “a”, “an”) are removed. Since these words are simply used to 
connect content words, they can be straightforwardly replaced by edges. In (b), 
each remaining word is mapped to its canonical form 21 . As a consequence. 


conjugated verbs are mapped to their infinitive forms. Likewise, nouns are 
mapped to their singular forms. In order to obtain word lemmas 21 , it is 


imperative to know in advance the part-of-speech of words. In this study, each 
word was labeled with its part-of-speech using a maximum entropy model [50| . 
To illustrate the modeling of a text as a network, the pre-processing steps (a) 
and (b) are applied to a short extract from the book “Adventures of Sally”, 
by P.G. Wodehouse: 


Original text: “If Sally had been constantly in Bruce Carmyle’s thoughts 
since they had parted on the Paris express, Mr. Carmyle had been very 
little in Sally’s-so little, indeed, that she had had to search her memory 
for a moment before she identihed him”. 


(a) Removal of stopwords and punctuation marks: Sally constantly 
Bruce Carmyle thoughts parted Paris express Carmyle little Sally little 
search memory moment before identified 

(b) Lemmatization: Sally constant Bruce Carmyle think part Paris express 
Carmyle little Sally little search memory moment before identify 















6 


Diego Raphael Amancio 


The network obtained from the pre-processed text is shown in Figure 
Note that, after step (b), each distinct word becomes a distinct node and edges 
are established between adjacent words. 


BEFORE MEMORY LITTLE 



Fig. 2 Example of network obtained from the text: “If Sally had been constantly in Bruce 
Carmyle’s thoughts since they had parted on the Paris express, Mr. Carmyle had been very 
little in Sally’s—so little, indeed, that she had had to search her memory for a moment before 
she identified him”. 


4.2 Topological measurements of complex networks 

The topological characterization of complex networks can be performed by 
computing topological measurements. Currently, there exists several topolog¬ 


ical measurements 16 . In this study, the measurements usually employed for 


textual analysis were chosen to characterize the topological attributes of text 
networks. A swift description of each measurement is provided below. 

— Average node degree: this local measurement, quantifies the average 
connectivity of the neighbors: 


kn{i) 


( 1 ) 


Clustering coefficient: the clustering coefficient (C) is a quasi-local mea¬ 
surement that quantifies the density of links between neighbors. Mathemat¬ 
ically, the clustering coefficient is defined as C = Sria/nh, where 


na = 


k>j>i 


( 2 ) 


Tib — ^ ^ Cljidjb: df^idf^^j . 

k>j>i 


( 3 ) 


In textual applications, the clustering coefficient of specific words tends to 


quantify the number of distinct contexts in which the word appears 35 . 
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Accessibility: the accessibility (or diversity) (a) is a extension of the 
degree that is based on both topology and dynamics of networks 26 . This 


centrality index is relevant to identify topological high-degree nodes that 
effectively access only a few neighbors f^. To define this measurement, 


consider the following definition. Let be the likelihood of a random 
walker to go from node i to node j in h steps. The accessibility is computed 
as the irregularity of the distribution of : 


dh) 


(i) = exp (^- Inp^^f 


(4) 


This measurement has been employed to detect the border of complex 
networks 51 . In textual applications, the accessibility has been useful to 


identify core concepts, allowing thus the construction of informative auto¬ 
matic summarizers [26| . 

Average shortest path length: the shortest path length (1) quantifies 
the typical distance between two nodes of the network. This measurement 


was employed because it has been useful in textual applications 135,52 . In 


word adjacency networks, this measurement has proven relevant to identify 


keywords, even if they are not very frequent 35 . 

Betweenness: the betweenness (B) is a centrality measurement. This 
means that the highest values of betweenness are assigned to the most rel¬ 
evant concepts in word adjacency networks. This measurement quantifies 
how easily a node can be accessed, provided that walks are performed via 
shortest paths. Let be the number of shortest paths between nodes 
i and j passing through node m. If is the total number of shortest 
paths between i and j (passing through any intermediary node), then the 
betweenness is defined as 


^ 3 



9ij 


(5) 


In textual networks, the betweenness quantihes the number of distinct 
contexts of a given word 35 . Unlike the clustering coefficient, this mea¬ 
surement uses the global network connectivity to infer the number of con¬ 
texts 


35 


Assortativity: the assortativity (r) quantifies degree-degree correlations 54 
In other words, it measures the tendency of nodes with similar degree to 
be connected. Mathematically, it can be defined as 


r = 






(6) 

Networks whose assortativity take positive values are referred to as assorta- 
tive networks. On the other hand, if there is a negative correlation between 
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the degree of linked nodes, then the network is disassortative. In word ad¬ 
jacency networks, a disassortative behavior arises even when stopwords are 


removed from the analysis 55 


4.3 Pattern recognition methods 


In a supervised classihcation task, the objective is to automatically distin¬ 
guish objects (or instances) according to their classes. The characterization 
of each object is made with object attributes (or features). In this study, one 
desires to distinguish between two class: (i) the “real” class, which include 
real scientific papers; (ii) and the “fake” class, which encompasses the papers 
automatically generated by the SCIGen algorithm. As features, I chose the 
network measurements described in Section |4.2| The following pattern recog¬ 
nition methods were employed in this study: 

— Naive Bayes (NBY): the naive bayes classiher uses the Bayesian optimal 
decision rule to classify an object. The class c' is chosen if the condition 

P(c'|m) > P(cfc|m), (7) 


holds for each Ck 7 ^ c', where P{ck\m) is the likelihood of class Ck to appear 
in the context represented by the set of network measurements m. In most 
cases, the exact behavior of P^Cklm) is unknown. To overcome this issue, 
the Bayes’ theorem can be used: 


/ D/ I ^ Pim\ck) , 

c = arg max P(cfcm) = argmax——— —P(Ck) 

Ck Ck F[m) 

= argmaxP(m|cfc)P(cfc) = argmax [logP(m|cfc) -I-logP(cfc)]. 


( 8 ) 


Assuming in eq. Q attribute independence and considering that the topo¬ 
logical context is given by a set of network measurements m = {mi, m 2 .. ■}, 
then P(rn\ck) can be written as 


P(cfc|m) = P{{mi\m, G m}\ck) = P{mi\ck). (9) 

rriiGm 


Therefore, the accurate class c' associated to the unknown instance is 


c'= argmax [logP(cfe)-b ^ logP(mj|cfc)]. 


( 10 ) 


To illustrate the decision process, consider Figure The position of each 
circle in the x-axis represents the value obtained for a given measurement. 
Distinct colors represent different classes (ci=“blue” and C 2 =“red”). Con¬ 
sidering that the frequency of occurrence in the dataset of each class is 


the same, the term logP(cfc) can be disregarded in eq. 10 The remaining 


term, the likelihood P{mi\ck), can be estimated via the Parzen window 
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Fig. 3 Example of classification between two classes (red and blue) using the Naive Bayes 
algorithm. The probability distribution of each class is used to create decision boundaries. 


method 56 . Therefore, the decision boundaries are established according 
to 

c = argmaxP(TO|cfe). (11) 

Ck 

Nearest neighbors (KNN): in this algorithm, the classification of an 
unknown instance is performed with a voting process which considers the 
k nearest neighbors. If most of the k nearest neighbors belong to the class 
Cfe, then Ck is associated to the unknown instance. In Figure]^ the in¬ 
nermost dashed circle represents the set of instances used for the voting 
process when fc = 5. In this case, the class associated to the unknown in¬ 
stance represented by a question mark (?) is the red class. If the fc = 12 
nearest neighbors are chosen for the voting process, the most frequent class 
becomes the blue class. Finally, if fc = 19, the class associated to the un¬ 
known instance is the blue class. In this paper, the value fc = 1 was used 


since this value usually provides highest accuracy rates 57 . 


o,- 




□ 


o 


'\0 


; 

\o''Fn n/'o.'D 


MEASUREMENT X 


Fig. 4 Classification of the unknown instance (central question mark) using the kNN al¬ 
gorithm. If the innermost dashed circle is used {k = 5), then the class associated to the 
unknown instance is the reddish one. 


— Decision trees (C45): this method uses a tree as a data structure to 
represent the emergent patterns of the dataset 58 . More specifically, in a 


decision tree, each node represents an attribute and edges correspond to 
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tests performed on attributes (see Figure [^. The decision process starts 
at the root (i.e. the node with no parents). When a leaf node is reached, 
the class associated to that node is selected. The generation of a decision 
tree requires the definition of a measure that is able to identify the most 
informative attribute at each step of the algorithm. More specifically, in 
this paper, I used the Kullback-Leibler divergence 59 . Mathematically, 


the Kullback-Leibler divergence f2{Str,mi) of the attribute rrn computed 
in the training dataset Str is 


= 'H{Str)-'H{Str\mi), (12) 


where ^-[{Str) is the entropy computed in the the training dataset Str and 
7i{Str\'mi) is the entropy of the dataset when the value of rrii is specified. 
Particularly, 'H{Str\'mi) can be computed from Str as 


%{Str\mi) 


E 

v^V{7ni) 


\Pitr} € — ■I'l 

\^\ 


• HiiP^tr) G StrWltr] 




(13) 

where V(rrii) is the set of all values taken by the attribute rui in the training 
dataset. 



Fig. 5 Example of a decision tree. To decide the class of a unknown instance, consider that 
attributes are Fi = —0.10, F 2 = 0.11 and F 3 = 0.38. The decision process starts at the 
leftmost node, the root. The first test leads to the edge “NO” and the second edge leads to 
the edge labeled as “YES”. Therefore, the class associated to the unknown instance is the 
class C 2 . 


4.4 Quantifying feature relevance 

The method employed for quantifying feature relevance assigns high values of 
relevance for a given attribute if its use usually yields high quality classifiers. 
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More specifically, this method counts the frequency of appearance of each 
feature among the best classifiers, when one analyzes all possible combination 
of features. Let F be a set comprising n/ features. Using F, it is possible to 
generate ric = 2".f distinct combinations of features. To quantify the relevance 
of each feature, the tic combinations are sorted in decreasing order according to 
the accuracy rate provided by each combination. Suppose that represents 
the ordered set of combinations, where 




ij 


1 if the Fth best combination employed the j-th feature, 
0 otherwise. 


(14) 


Then can be used to verify if a feature j tends to appear among the best 
classifiers. This can be done by defining the function f{x) as 

X 

{a; e N*|a; < nj. (15) 

Note that f{x) increases quickly whenever j is frequent among the best com¬ 
binations of features. Conversely, if a given feature j is more frequent among 
the worst classifiers, f{x) increases signihcantly only for high values of the 
domain. Therefore, the prominence p{j) of feature j can be computed as the 
area underneath the curve f(x)'. 


/ _ 

fix)dx = ^ 


2-1 


(16) 


2=1 k—1 


5 Results 


60 


61 


In this study, the style of real and fake manuscripts were compared. As fake 
manuscripts, I considered the texts generated by the SCIGen algorithm, which 
produces scientific manuscripts using a proper grammar. The style of the SCI¬ 
Gen papers were compared with the style of real manuscripts recovered from 
the following sources: (a) the Pattern Recognition Letters journal (PRL) 

(b) the arXiv repository comprising Computer Science papers (arXiv/cs) 
and (c) the Journal of Informetrics (JI) [62| . Four hundred manuscripts were 
used in the experiments. Note that most of the measurements presented in 
Section |4.2| are local measurements. Therefore, each node is associated to a 
specific value. To characterize each manuscript, I used the global distribution 
of measurements for all the words in the manuscript. Here, the goal is to ob¬ 
tain quantities characterizing relevant factors of the distributions to be used 


as global measurements. Using the same strategy of previous studies 37 52 


the average (X) and the deviation AX of each local measurement X was 
extracted. Therefore, the features employed to characterize the style of the 
manuscripts were 


(a('^=2)), Aa^^=^\ (fc„), Z\fc„, (F), AB, (C), AC, r, {1), 


and Al. 
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To minimize the correlation of the above measurements with the frequency 
of words, the following normalization was applied. Let X be the value of a 
given measurement obtained in a text and the average value of the 

same measurement obtained in 20 randomized versions of the text. Then, the 
normalized measurement is computed as 


^ (a:(R))' 


(17) 


After characterizing the topological structure of the manuscripts, the hypoth¬ 
esis that real and fake manuscripts yields distinct network properties was 
probed. A twofold method was employed to accomplish the identification of 
fake papers: a univariate and a multivariate approach. 


5.1 Univariate analysis 

In this approach, the discriminability of real and fake papers was analyzed by 
considering just a single measurement or each classifier generated. The accu¬ 
racy rate obtained in each dataset is shown in Table The discrimination of 
PRL and SCIGen papers was accomplished with an accuracy of 79% in the 
best scenario, when the average neighbor degree (fc„) and the average accessi¬ 
bility was employed along with the tree (C4.5) algorithm. The accurate 

discrimination between arXiv/cs and SCIGen papers could be performed in 
88% of the cases. This accuracy rate was obtained with the standard deviation 
of the average neighbor degree Z\fc„. The highest accuracy rate occurred when 
distinguishing JI and SCIGen papers. In this case, 91% of the papers could be 
successfully discriminated. Taken together, these results confirm that the net¬ 
worked representation of texts is useful to distinguish real manuscripts from 
those automatically generated from SCIGen. Especially, it is possible to note 
that SCIGen texts are more similar to the PRL manuscripts, which can be 
explained by both content and structural similarities, because both datasets 
comprise letters about computer science issues. While the arXiv/cs also com¬ 
prises Computer Science papers, the format allowed by this repository is much 
more generic than the structural format generated by the SCIGen algorithm. 
Hence, as expected, a larger discriminability was found when SCIGen and 
arXiv/cs were compared. When comparing JI and SCIGen, an even larger dis- 
tinguishability was obtained probably because both structural and semantical 
contents are distinct. 

The individual performance of the attributes employed in the univariate 
analysis can be summarized as follows: 

— Accessibility: the average accessibility presented an average dis¬ 

criminative ability. The average accessibility at the third level was particu¬ 
larly useful in the PRL dataset, since the highest accuracy was found when 
the was employed with the C4.5 method. The deviation 

proved specially relevant to identify real papers in the arXiv/cs and JI 
datasets. 
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Table 1 Accuracy rate (%) obtained for each measurement in the univariate approach. The 
best discriminability was found when comparing JI and SCIGen papers. 



PRL 

arXiv.org/cs 

JI 

KNN 

NBY 

C45 

KNN 

NBY 

C45 

KNN 

NBY 

C45 

Accessibility 

72 

78 

78 

74 

79 

78 

74 

80 

81 

Accessibility 

44 

62 

48 

56 

64 

55 

49 

60 

49 

Accessibility 

72 

77 

79 

66 

73 

73 

72 

76 

75 

Accessibility 

44 

61 

48 

83 

88 

87 

86 

91 

91 

Avg. N. Degree (fc„) 

72 

77 

79 

66 

63 

57 

70 

78 

75 

Avg. N. Degree Akn 

45 

62 

48 

83 

86 

88 

85 

90 

90 

Betweenness {B) 

66 

77 

78 

68 

77 

77 

69 

77 

74 

Betweenness AB 

61 

71 

66 

47 

64 

64 

61 

62 

58 

Clustering (C) 

50 

46 

49 

57 

50 

50 

50 

64 

53 

Clustering AC 

58 

58 

54 

50 

55 

55 

54 

68 

63 

Assortativity r 

59 

76 

74 

71 

71 

72 

73 

79 

78 

Shortest paths {1) 

63 

71 

65 

62 

73 

71 

66 

75 

69 

Shortest paths Al 

56 

58 

59 

48 

64 

60 

67 

68 

68 


— Neighbors degree: an excellent performance was found for the Z\fc„ in 
the arXiv/cs and JI datasets. Conversely, the average (fc„) performed well 
mainly in the PRL dataset. 

— Betweenness: the average {B) turned out to be more relevant than the 
deviation AB. Nevertheless, the use of the betweenness as a feature yielded 
relatively low accuracy rates. 

— Clnstering coefficient: this measurement yielded low accuracy rates in 
all three datasets. This means that the fraction of links between neighbors 
is not relevant for this task. The most relevant links, therefore, are those 
connecting neighbors and further hierarchies. 

— Assortativity: in most cases, this measurement presented an average per¬ 
formance. 

— Shortest paths: the average {1) was found to be more informative than 
the deviation Al. The best performance achieved with the shortest path 
length, however, was only 71%. 

Even though the univariate analysis is able to identify which attributes are 
more useful to discriminate specific classes, this analysis does not take into 
consideration the inter-relationship between different attributes. Because the 
interaction of attributes may improve the quality of the classifiers, in the next 
section, I approach the classification task as a multivariate problem. 


5.2 Multivariate analysis 

In the multivariate analysis, all 13 measurements were combined and applied 
as features of the classifiers. A two-dimensional projection of the data using 
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Table 2 Accuracy rate (%) obtained when distinguishing real (PRL, arXiv/cs or JI) from 
artificial papers (SCIGen). Unlike the univariate analysis, all 13 topological measurements 
were employed. 


Dataset 

KNN 

NBY 

C45 

PRL 

83 

89 

85 

arXiv.org/cs 

95 

94 

87 

JI 

95 

95 

95 


the principal component analysis technique 53 is shown in Figure Inter¬ 
estingly, it is possible to note that the worst discrimination occurred in the 
PRL dataset, as revealed by the large overlapping region. Conversely, a much 
better discrimination was achieved with the arXiv.org/cs dataset. These re¬ 
sults are consistent with the patterns found when the univariate analysis was 
performed. Another interesting pattern arising from the visualization provided 
in Figure [^concerns the variability of style of SCIGen papers. It is clear that 
the style of SCIGen papers displays a lower variability when compared to the 
style of real texts. This effect can be easily perceived, e.g. by observing that 
SCIGen papers are scattered in a small region in Figure |^b). 

The accuracy rates obtained with the multivariate classification is shown in 
Table When one compares the results obtained here with the ones achieved 
with the univariate analyses, it is clear that the multivariate analysis im¬ 
proved the discriminative ability of the classifiers. The accuracy rate in the 
PRL dataset improved 10% (from 79% to 89%). In the arXiv/cs dataset, the 
accuracy went from 88% to 95%. The lowest increase in accuracy occurred 
for the JI dataset, which already had provided an excellent discriminability 
with the univariate approach. These results suggest that the interaction of 
attributes is able to improve the identification of fake papers generated by the 
SCIGen algorith, especially if the separation between real and fake papers is 
not so clear when a single measurement is employed to generate the classifiers. 



m 


(B) 


(C) 


Fig. 6 Principal component analysis obtained with all measurements. The highest discrim¬ 
inability, as revealed by the size of the overlapping regions, occurs for the J. Inform, and 
arXiv datasets. The topological variability of the automatically generated texts from SCIGen 
tends is lower than the variability observed in real manuscripts. 
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Table 3 Spearman’s rank correlation coefficient for distinct rankings of attributes. Note 
that, in general, there is a strong correlation between the rankings obtained in the same 
dataset. 


Dataset 

PRL 

arXiv/cs 

JI 

KNN and NBY 

0.703 

0.703 

0.709 

KNN and C45 

0.648 

0.698 

0.916 

NBY and C45 

0.192 

0.654 

0.640 


Although all attributes have been used as input to the machine learning 
methods, only some of them are selected to generate a given model. This is 
clear when one observes the decision tree shown in Figurej^ which summarizes 
the patterns recognized in the PRL dataset. The relevance of each attribute 
employed in the multivariate analysis was quantified with the technique de¬ 
scribed in Section 4.4 The ranking obtained for each measurement in each 
classifier is shown in Table Note that there is a strong consistency between 
rankings of measurements across distinct classifiers in the same dataset. This 
consistency is confirmed by the high values of Spearman’s rank correlation of 
rankings (see Table [^. The performance of each measurement for the classifi¬ 
cation is commented below. 


assortativity 



Fig. 7 Decision tree obtained for distinguish PRL from SClGen manuscripts. This decision 
tree is able to accurately identify the class (PRL or SCIGen) of the manuscripts in 85% of 
the cases. Note that not all measurements were employed for the classification. 


— Accessibility: the performance of this measurement depends on the dataset. 
The deviation turned out to be the best measurement to iden¬ 

tify real papers in the arXiv/cs and JI datasets. Differently, in the PRL 
dataset, the average performed better than the deviation 
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Table 4 Ranking of measurements based on the accuracy rates of the classifiers, where 1 
means best, 2 second best and so forth. In this analysis, the multiple interactions between 
features was considered. The results obtained for each classifier is showed for each dataset 
considered. Note that, in general, the performance depends on the dataset. 



P. Rec. Lett. 

arXiv.org/cs 

J. Informetr. 

KNN 

NBY 

C45 

KNN 

NBY 

C45 

KNN 

NBY 

C45 

Accessibility 

3 

4 

5 

7 

10 

10 

3 

6 

3 

Accessibility 

9 

9 

8 

12 

12 

5 

13 

11 

15 

Accessibility 

4 

5 

3 

4 

8 

9 

4 

7 

6 

Accessibility 

10 

10 

9 

1 

1 

1 

1 

1 

1 

Avg. N. Degree (fc„) 

5 

6 

2 

5 

9 

6 

5 

8 

5 

Avg. N. Degree Akn 

11 

11 

10 

2 

2 

2 

2 

2 

2 

Betweenness {B) 

2 

1 

1 

3 

3 

3 

6 

5 

7 

Betweenness AB 

1 

3 

6 

6 

4 

4 

9 

3 

12 

Clustering (C) 

8 

8 

11 

9 

6 

7 

10 

10 

8 

Clustering AC 

12 

7 

12 

10 

7 

8 

11 

9 

10 

Assortativity r 

7 

2 

13 

11 

5 

13 

7 

4 

4 

Shortest paths (1) 

6 

13 

4 

8 

11 

11 

8 

13 

9 

Shortest paths Al 

13 

12 

7 

13 

13 

12 

12 

12 

11 


The best performance using accessibility measurements in the PRL dataset 
was achieved with the average 

— Neighbors degree: an excellent performance was observed for the devia¬ 
tion Akn in both arXiv/cs and JI datasets. Note that Akn reached second 
place in both repositories. Particularly, in the PRL dataset, the average 
(fc„) performed better than the deviation 

— Betweenness: the average {B) performed very well in all three datasets. 
This suggests that this measurement becomes very discriminative when 
combined with other attributes. Note that, in the univariate analysis, the 
betweenness displayed low accuracy rates (see Table [^. 

— Clustering coefficient: the combination with other attributes does not 
seem to improve the discriminability of this measurement. 

— Assortativity: the importance of this measurement depends on the dataset. 
The best performance, a second position, was achieved in the PRL dataset 
when the Naive Bayes classifier was used. 

— Shortest paths: the average {1) and specially the deviation Al ranked 
among the worst measurements. Therefore, similarly to the clustering co¬ 
efficient, the average shortest path length is not informative even when 
associated with others measurements. 

All in all, the combination of attributes improved the performance of the 
classifications. The attributes with the highest discrimination ability were the 
average betweenness {B) (PRL dataset) and the standard deviation of the 
accessibility . Although some measurements turned out to be not in¬ 

formative in specific datasets, they still can be useful in other scenarios, as 
the discriminability may depend on the data distribution. For this reason, the 
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clustering coefRcient and the average shortest path length should be tried in 
other datasets. 


6 Conclusions 


In the current paper, I have investigated the hypothesis that artihcially gener¬ 
ated manuscripts can be distinguished from real scientific papers via topologi¬ 
cal characterization of complex networks. The combination of network features 
(extracted from the word adjacency model) and machine learning methods al¬ 
lowed the correct identification of SCIGen papers in 89% of the cases (worst 
scenario). This means that there are hidden patterns in the organization of 
papers generated by SCIGen that differs from the structural patterns arising 
from real texts. Even though the techniques presented in this manuscript does 
not outperform the methods based on textual content, it could be employed in 
applications where the complementary nature of the proposed attributes plays 


a prominent role to discriminate pieces of texts with similar content 42,63 


The analysis of relevance of attributes revealed that the combination of 
distinct topological attributes is the most successful approach. Concerning 
the individual performance of topological features, the accessibility and the 
betweenness performed particularly well mainly in the multivariate analysis. 
Conversely, the clustering coefficient and the shortest path length displayed 
the poorest performance among the topological features employed. The results 
presented here confirm, as a proof of principle, that the word adjacency model 
can be useful to identify fake papers. Future works could pursue an improve¬ 
ment of performance with a hne tuning of classifiers parameters 57 . Another 


possibility is to propose novel topological measurements to combine the tech¬ 
niques presented in this paper with traditional statistical natural language 
processing methods 21 . 
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